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Syntax-directed  Translation 
of  Concurrent  Programs 
into  Self-timed  Circuits1 

Steven  M.  Burns  and  Alain  J.  Martin 

Computer  Science  Department 
California  Institute  of  Technology 
Pasadena,  CA  91125 


We  present  a  method  for  the  automatic  compilation  of  concurrent  pro¬ 
grams  into  self -timed  circuits.  The  compilation  is  directed  by  the  syn¬ 
tax  of  the  source  language  and  produces  linear-sized  implementations 
of  arbitrary  concurrent  programs.  We  have  constructed  a  compiler 
which  performs  this  translation. 

An  automatic  method  for  synthesizing  a  digital  circuit  from  an  ab¬ 
stract  specification  represents  an  important  step  forward  in  managing 
the  complexity  of  VLSI  system  design.  By  allowing  the  designer  to 
think  in  terms  of  high-level  programs  instead  of  detailed  circuits,  sys¬ 
tems  can  be  designed  in  less  time  with  fewer  errors.  In  fact,  if  formal 
proofs  are  used  to  verify  that  the  program  implements  the  system 
specification,  the  circuits  derived  are  correct  by  construction. 

To  take  full  advantage  of  the  inherently  concurrent  nature  of  dig¬ 
ital  circuits,  we  have  chosen  a  variant  of  Communicating  Sequential 
Processes  (CSP)[4]  as  the  source  language  for  our  synthesis  method. 
While  self-timed  circuits[10]  offer  several  advantages  over  clocked  cir¬ 
cuits,  we  chose  them  as  our  synthesis  target  mainly  because  of  the 
composition  properties  they  possess.  Self-timed  sub-circuits  may  be 
composed  together  to  form  circuits  of  arbitrary  size  without  the  haz¬ 
ards  associated  with  clock  distribution  in  large  synchronous  designs. 

In  this  paper,  we  present  a  mechanical  method  for  transforming  a 
concurrent  program  into  a  semantically  equivalent  self-timed  circuit. 
This  method  is  based  on  the  manual  circuit-compilation  techniques  of 
[6,8],  but  the  translation  mechanism  follows  standard  syntax-directed 
techniques.  We  describe  the  necessary  translation  rules  and  apply 
these  to  an  example.  This  transformation  method  allows  any  program 

1in  Advanced  Research  in  VLSI:  Proceedings  of  the  5th  MIT  Conference,  MIT 
Press,  pp.  35-50  (1988) 
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(process) 

(channel) 

(port) 

(var) 

(sequence) 

(statement) 

(gca) 

<gc) 

(responses) 

(response) 

(expr) 

(conjunct) 

(primary) 


(  (process)  {  I  I  (process)  }  )  {  (channel)  } 

|  {  (port)  }  {  (var)  }  (sequence) 
channel  (  (pNAME)  ,  (pNAME)  ) 

( passive  |  active  )  (pNAME)  (  (rINT)  ,  (rINT)  ) 
boolean  (vNAME)  =  (  true  |  false  ) 

(statement)  {  ;  (statement)  } 

skip  |  (vNAME)  (  up  |  down )  |  [  (gcs)  ]  |  *  [  (gcs)  ] 

|  (pNAME)  (  (vINT)  )  :  [  (responses)  ] 

(gc)  {  i  (gc)  } 

(expr)  — >  (sequence) 

(response)  {  I  (response)  } 

(vINT)  — >  (sequence) 

(conjunct)  {  or  (conjunct)  } 

(primary)  {  and  (primary)  } 

(vNAME)  |  probe  (pNAME)  |  true  |  false 
|  not  (primary)  |  (  (expr)  ) 


Figure  1:  BNF  for  Source  Language 


in  the  source  language  to  be  compiled  into  a  self-timed  circuit,  and 
furthermore  constructively  proves  that  the  size  of  the  resulting  circuit 
is  linearly  related  to  the  size  of  the  source  program.  The  method  has 
been  automated  and  we  compare  the  designs  produced  by  the  compiler 
with  designs  produced  by  manual  methods. 


1  Source  Language 

The  source  language  is  based  on  CSP[4]  with  the  addition  of  the 
probe[ 9]  and  a  new  communication  construct.  A  complete  description 
of  the  language  syntax  is  given  in  Figure  1.  A  program  in  this  lan¬ 
guage  consists  of  a  set  of  sequential  processes  with  interconnecting 
channels.  Associated  with  each  sequential  process  is  a  set  of  ports, 
a  set  of  private  variables,  and  a  list  of  statements  to  be  executed  se¬ 
quentially.  Ports  that  do  not  connect  to  another  process  connect  to 
the  environment. 

Only  boolean  variables  are  allowed.  Variables  are  changed  by  as¬ 
signment  to  true  (x  up)  or  to  false  (x  down).  The  selection  ([(gcs) 3 ) 
and  repetition  (*  [  (gcs)  3 )  constructs  are  based  on  guarded  commands. 
We  use  *  [(sequence)]  as  an  abbreviation  for  *[true — > (sequence) 3 . 

Synchronization  between  two  processes  is  accomplished  by  zero- 
slack  communication  actions  across  channels  denoted  by  pairs  of  ports. 
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Of  the  two  ports  which  make  up  a  channel,  one  is  declared  active  and 
the  other  is  declared  passive.  The  process  which  owns  the  passive  port 
may  determine  whether  the  other  process  is  waiting  for  a  communica¬ 
tion  on  this  channel  by  evaluating  a  boolean  condition  called  a  probe. 
Probes  may  be  used  in  arbitrary  boolean  expressions. 

Though  concurrently  operating  processes  may  not  share  variables, 
processes  may  communicate  data  by  exchanging  values  from  small 
sets  during  a  synchronization  action.  The  communication  construct 
provides  a  syntax  allowing  differing  sequences  of  commands  to  be 
executed  based  on  the  value  received  during  a  communication.  When 
declaring  a  port,  we  specify  both  the  send  and  receive  sets  of  values — 
each  set  being  represented  by  a  single  integer.  For  example, 

passive  L(3,2) 

declares  a  passive  port  L  with  send  set  {0,1,2}  and  receive  set  {0,  l}. 
The  communication  action  (on  the  same  port,  L) 

L(l):[  0  — >  x  down  1  1  — >  x  up  ] 

sends  the  value  1  while  simultaneously  receiving  either  a  1  or  a  0.  If 
a  0  is  received,  x  is  set  to  false;  if  a  1  is  received,  x  is  set  to  true. 

The  output  value  specification  of  the  communication  action  may 
be  suppressed  if  the  port  has  only  one  send  value.  Similarly,  the 
receive  value  selection  need  not  be  specified  if  the  port  has  only  one 
receive  value. 

We  illustrate  the  programming  language  and,  later,  the  compila¬ 
tion  procedure  with  the  following  example. 

Routing  Automata 

By  interconnecting  processes  of  two  different  types,  we  can  construct 
worm-hole  routing  systems  [3]  that  allow  processors,  connected  in  a 
variety  of  different  network  structures,  to  communicate.  The  switch 
process  (diamond),  together  with  an  arbiter  process  (circle),  are  suffi¬ 
cient  to  implement  deterministic  routing  systems  for  hypercubes,  tori, 
meshes  and  other  networks  in  which  the  path  of  the  message  can  be 
determined  before  the  message  is  sent.  Figure  2  shows  a  building  block 
configuration  for  sending  messages  up  and  right  in  a  two-dimensional 
mesh.  Each  processor  injects  messages  (strings  of  bits)  into  the  sys¬ 
tem  through  port  P  and  extracts  messages  through  port  Q.  The 
processor  sending  a  message  prefixes  the  message  with  a  string  of  bits 
specifying  the  path  the  message  will  travel  through  the  network.  A 
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Figure  2:  Building  Block  Interconnections  for  a  One-way  Mesh 


special  token  is  used  to  mark  the  end  of  the  message.  The  packet — the 
combination  of  the  header,  the  message,  and  the  trailing  token — cuts 
a  path  through  the  mesh,  first  to  the  right  and  then  up,  to  its  desti¬ 
nation.  The  arbiter  transmits  an  entire  packet  contiguously  from  an 
input  port  to  the  output  port.  Packets  from  different  input  ports  are 
not  interleaved.  The  switch  consumes  the  first  bit  of  the  packet,  and, 
based  on  whether  the  bit  is  one  or  zero,  passes  the  remainder  of  the 
message  out  through  either  port  S  or  port  T. 

Figure  3  shows  programs  for  these  two  sequential  processes.  The 
switch  process  first  communicates  through  port  X,  and  raises  one  of 
two  boolean  flags,  s  or  t.  If  s  was  raised,  the  process  acts  as  a 
one-place  buffer,  sending  the  value  received  by  port  X  out  through 
port  S,  until  the  special  trailing  value  2  is  received;  in  which  case, 
flag  s  is  lowered,  and  the  repetition  terminates.  If  t  was  raised,  the 
process  behaves  in  a  similar  manner,  except  T  is  used  as  the  output 
port.  The  arbiter  process  performs  fair  arbitration[7]  between  the  two 
input  streams  by  first  checking  if  an  input  is  pending  on  port  A  and, 
if  it  is,  then  acting  as  a  one-place  buffer  with  output  port  Y  until  the 
trailing  value  is  received.  The  port  B  will  be  chosen  next  if  an  input  is 
pending,  ensuring  that  each  port  is  given  the  opportunity  to  transmit 
its  packet. 
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passive  X(l,3)  active  S(3,l)  active  T(3,l) 
boolean  s  =  false  boolean  t  =  false 
*[  X:  [  l-->  s  up  |  0-->  t  up  ] ; 

*[  s  -->  X:  [  0-->  S(0)  |  l-->  S(l)  I  2— >  S (2)  ;  s  down  ] 

|  t  — >  X:  [  0— >  T(0)  |  1— >  T(l)  I  2— >  T(2);  t  down  ] 

]  ] 

process  switch (X.S.T) 


passive  A(i,3)  passive  B(l,3)  active  Y(3,l) 

boolean  a  =  false  boolean  b  -  false 

*[  [  probe  A  -->  a  up  I  not  probe  A  — >  skip  ] ; 

*[  a  -->  A:  [  0-->  YCO)  I  l-->  Y(l)  |  2— >  Y<2)  ;  a  down  ]]; 

[  probe  B  -->  b  up  I  not  probe  B  -->  skip  ] ; 

*[  b  -->  B:  [  0-->  Y(0)  I  l-->  Y(l)  I  2— >  Y(2)  ;  b  down  ]] 

] 

process  arbiter(A ,B , Y) 


Figure  3:  Programs  for  the  Switch  and  Arbiter  Processes 


2  Target  Language 

The  target  of  the  compilation  is  a  self-timed  circuit — a  set  of  circuit 
variables  (nodes)  interconnected  by  a  set  of  operators  (gates).  These 
circuits  are  designed  to  function  correctly  regardless  of  the  internal 
delays  of  the  operators.  The  target  operator  types  include  the  com¬ 
binational  operators  WIRE,  AND,  and  OR;  and  the  state-holding 
operators  shown  in  Figure  4.  Each  operator  is  defined  in  terms  of  a 
set  of  production  rules[6,8].  A  production  rule  is  a  simple  transition 
rule  of  the  form  G  i — *■  S;  where  G  is  a  boolean  expression  and  S  is  an 
assignment  to  true  or  false.  All  references  to  a  circuit  variable  are  as¬ 
sumed  to  have  the  same  value  (isochronic  forks)  [2,6].  A  synchronizer, 
which  cannot  be  represented  in  terms  of  production  rules,  is  included 
to  allow  the  implementation  of  programs  with  negated  probes[7].  The 
synchronizer,  as  well  as  the  other  operators,  have  been  implemented 
as  CMOS  standard  cells. 

Self-timed  circuit  implementations  of  concurrent  programs  are  gen¬ 
erated  by  implementing  each  sequential  process  as  a  separate  sub¬ 
circuit.  The  sub-circuits  are  connected  (by  wire  operators)  only  to 
implement  communication  actions.  The  simultaneity  required  in  the 
zero-slack  communications  is  implemented  using  a  four-phase  hand- 
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Figure  4:  State-holding  Operators 


shaking  protocol.  In  order  to  implement  the  exchange  of  data,  the 
usual  request/acknowledge  pair  of  wires  is  replaced  by  one  wire  for 
each  send  value  and  one  wire  for  each  receive  value.  For  example,  us¬ 
ing  this  unary  encoding  scheme,  a  channel,  connecting  the  active  port 
Y  of  the  arbiter  process  and  the  passive  port  X  of  the  switch  process, 
is  implemented  using  four  wire  operators:  three  pointing  from  Y  to  X 
and  one  pointing  from  X  to  Y  (Figure  5). 


arbiter  switch 


Figure  5:  Implementation  of  a  CSP  Communication 


The  delay-insensitive  nature  of  interprocess  communication  al¬ 
lows  concurrent  programs  with  any  number  of  processes  to  be  im¬ 
plemented.  In  implementing  individual  processes,  we  shall  exploit 
this  property,  through  a  technique  known  as  process  decomposition. 
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3  Syntax-directed  Translation 

The  translation  from  a  source  program  into  a  target  circuit  is  per¬ 
formed  using  standard  syntax-directed  techniques.  We  derive  rules  for 
generating  sub-circuits  that  correspond  to  each  syntactic  construct  of 
the  source  language,  and  for  composing  these  sub-circuits  to  generate 
circuits  for  arbitrary  programs.  These  translation  rules  are  derived 
by  applying  the  process  decomposition  transformat  ion  [6]. 

3.1  Process  Decomposition 

An  arbitrary  program  statement  /3  can  be  replaced  by  a  single  active 
communication  C'  and  a  new  sub-process  implementing  j3. 

or,P\ 7  >  («;C";7  ||  *[[C  — ►  P‘,C]]) 

The  probe  C,  of  the  connected  passive  port,  is  used  to  guard  the  ex¬ 
ecution  of  the  statement  /?.  While  process  decomposition  introduces 
a  new  sub-process,  it  does  not  add  concurrency  to  the  implementa¬ 
tion;  the  active  communication  C1  cannot  finish  until  its  correspond¬ 
ing  passive  communication  C  does,  and  thus  the  strict  sequencing 
a; /?;7  is  maintained.  The  original  process  and  the  new  sub-process 
may  now  share  variables  and  ports,  since  processes  and  sub-processes 
are  never  active  concurrently.  The  variables  and  ports  of  a  process 
are  implemented  by  circuit  variables,  which  must  be  distributed  to 
many  sub-circuits.  We  assume  these  circuit  variables  are  distributed 
by  isochronic  forks.  This  assumption  may  be  relaxed  by  suspending 
a  sequential  process  until  distribution  of  the  variable  to  every  refer¬ 
ence  point  is  detected.  This  refinement,  described  in  [2],  reduces  all 
isochronic  forks  to  size  two. 

Communications  across  channels  introduced  by  process  decompo¬ 
sition  have  simple  implementations  because  the  handshaking  actions 
of  the  passive  communication  may  be  interleaved  with  the  execution 
of  /?.  Wire  operators  are  typically  sufficient  to  implement  this  special 
form  of  the  passive  communication  action. 

We  apply  process  decomposition  to  generate  translation  rules  for 
the  BNF  rules  naming  (statement).  Every  occurrence  of  (statement) 
on  the  right-hand  side  of  a  BNF  rule  is  replaced  by  an  active  commu¬ 
nication.  For  every  BNF  rule  with  (statement)  on  its  left-hand  side,  a 
new  sub-process,  protected  by  a  passive  communication,  is  generated 
to  implement  the  right-hand  side. 
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We  generate  the  translation  rule  for  the  sequencing  constructby 
the  following  series  of  transformations: 

*[[C  — >  (statement); C]\ 

>  {BNF  rule } 

*[[C  — *  (statement) x ;  (statement)2;  C]] 

>  { Process  decomposition } 

(  *[[£_— 

||  *[[CX — >•  (statement)  1;  Cx]] 

|j  *[[C2 — >  (statement^ ; C2j] 

) 

The  first  sub-process  is  of  fixed  size  and  may  be  compiled  into  a  self- 
timed  circuit  using  the  techniques  in  [6,8].  Circuits  for  the  other  two 
sub-processes  are  generated  recursively  by  applying  the  translation 
rules.  Figure  6  displays  the  translation  rule  in  circuit  form.  The  sub¬ 
circuit  enclosed  in  box  D  is  used  in  the  implementation  of  several 
other  constructs.  In  the  following,  we  shall  refer  to  instances  of  this 
circuit  as  instances  of  a  D-element. 

The  other  translation  rules  that  use  (statement)  can  be  derived 
similarly.  Complete  derivations  are  described  in  [2].  Figure  7  shows 
the  circuit  representations  of  these  rules.  The  leaf  sub-processes 
are  interconnected  by  several  circuit  variables.  The  named  variables 
within  the  dashed  boxes  are  connected  to  form  the  complete  circuit. 
Each  assignment  sub-circuit  adds  a  new  input  to  the  flip-flop  imple¬ 
menting  the  variable.  The  output  of  the  flip-flop  is  distributed  to  all 
sub-circuits  using  the  variable.  Each  port  usage  adds  a  new  input  to 
an  OR  operator  merging  its  output  transitions.  A  circuit  variable  is 
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(process) 


[  (gcs)  ] 


I _ ....... 

(active) 


l _ 

(passive) 


Figure  7:  Circuits  Implementing  Statements 
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L 


Figure  8:  Translation  of  the  switch  Process 


generated  to  directly  implement  each  probe  by  merging  together,  with 
an  OR  operator,  each  input  wire  of  a  passive  port.  The  translation 
rule  for  (process)  uses  a  global  reset  variable  g.  When  g  is  false,  no 
process  is  active,  and  thus  all  state-holding  operators  can  be  reset  to 
an  initial  state. 

Figure  8  shows  the  first  few  translation  rules  applied  to  the  switch 
process.  The  complete  circuit  is  shown  in  Figure  9. 

3.2  Guarded  Command  Evaluation 

Data  channels  can  be  used  in  process  decomposition,  allowing  a  sub¬ 
process  to  communicate  the  result  of  an  evaluation  to  its  parent  pro¬ 
cess.  We  apply  this  general  form  of  decomposition  to  derive  transla¬ 
tion  rules  for  guarded  command  evaluation. 

The  implementation  of  a  guarded  command  set  (g.c.s.)  returns  one 
of  two  values:  1  if  a  guard  evaluated  to  true  and  its  corresponding 
command  was  executed,  and  0  if  no  guard  evaluated  to  true.  The 
repetition  statement  reevaluates  the  g.c.s.  if  1  is  returned.  Using  busy- 
waiting,  the  selection  statement  reevaluates  the  g.c.s.  if  0  is  returned. 

The  g.c.s.  process  is  decomposed  into  a  process  for  guard  set  eval¬ 
uation;  and  a  control  process  which  sequences  guard  evaluation  and 
the  associated  command  execution.  The  guard  set  process  selects  at 
most  one  true  guard,  returning  a  value  corresponding  to  the  selected 
guard.  This  process  also  detects  when  no  guard  evaluates  to  true,  and 
returns  the  value  0.  The  control  process  explicitly  introduces  the  state 
variables  necessary  to  distinguish  between  the  program  state  just  be¬ 
fore  guard  evaluation  and  the  program  states  just  after  a  guard  has 
been  selected,  thus  allowing  guard  evaluation  to  complete  before  sub¬ 
sequent  statements  change  variables  and  probes  used  in  the  guards. 

We  concentrate  now  on  evaluating  an  arbitrary  guard  set.  The  se- 
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Figure  9:  Optimized  Circuit  Implementing  the  switch  Process 
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m antics  of  the  language  does  not  specify  the  order  in  which  the  guards 
are  to  be  evaluated.  They  may  be  evaluated  in  any  order  or  all  si¬ 
multaneously.  We  introduce  three  schemes  for  implementing  guards: 
sequential ,  concurrent-all,  and  concurrent-one.  The  sequential  scheme 
imposes  a  total  order  on  the  guards  and  generates  a  circuit  which 
conditionally  evaluates  each  guard  and  each  connective  with  a  guard. 
Evaluation  of  the  guard  set  terminates  when  a  guard  evaluates  to 
true.  In  the  concurrent-all  scheme  all  guards  are  evaluated  concur¬ 
rently.  Evaluation  of  the  guard  set  finishes  after  all  guards  have  been 
evaluated.  In  the  concurrent-one  scheme  all  guards  are  evaluated  con¬ 
currently,  but  the  guards  and  the  sub-expressions  within  a  guard  have 
been  strengthened  so  that  only  one  path  through  the  evaluation  cir¬ 
cuit  will  become  active.  For  all  three  schemes,  we  must  ensure  mutual 
exclusion  among  the  guards  and  generate  the  all-false  value. 

3.2.1  Sequential  Guard  Evaluation 

The  guards  are  evaluated  one  by  one  until  either  one  evaluates  to 
true  or  the  last  guard  evaluates  to  false.  Mutual  exclusion  among  the 
guards  is  accomplished  implicitly  by  this  construction. 

Implementations  of  complex  expressions  are  built  up  from  imple¬ 
mentations  of  simple  ones.  Primitive  circuits  exist  for  evaluating 
variables  and  probes.  We  use  conditional  evaluation  to  implement 
the  logical  connectives.  Negation  is  accomplished  by  exchanging  the 
meanings  of  the  false  and  true  values.  We  show  the  translation  rules 
for  sequential  guard  evaluation  in  Figure  10. 

If  probes  are  used,  a  further  transformation  is  required.  Because 
the  value  of  a  probe  may  change  from  false  to  true  while  an  expression 
is  being  evaluated,  all  probes  used  more  than  once  in  a  guard  set  are 
evaluated  and  assigned  to  a  variable  before  the  guard  set  itself  is 
evaluated.  Probes  within  these  expressions  are  changed  to  refer  to 
the  corresponding  stable  variable. 

The  sequential  scheme  is  applied  to  implement  the  repetition  con¬ 
struct  of  the  switch  process  (see  outlined  sub-circuit  in  Figure  9).  The 
variable  s  is  evaluated,  and,  if  false,  t  is  evaluated.  If  both  are  false, 
an  all-false  signal  is  returned. 

3.2.2  Concurrent-all  Guard  Evaluation 

All  guards  can  be  evaluated  concurrently  if  each  guard  evaluation 
completes  before  the  result  for  the  entire  guard  set  is  returned.  If  the 
circuit  does  not  wait  for  all  the  evaluations  to  complete,  subsequent 
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Figure  10:  Circuits  Implementing  Sequential  Guard  Evaluation 


evaluations  may  fail  because  internal  variables  of  the  implementation 
have  not  been  reset  to  a  required  state. 

In  the  concurrent-all  scheme,  all  variable  and  probe  evaluations  for 
the  entire  guard  set  are  started  simultaneously.  Logical  connectives 
are  evaluated  by  first  evaluating  the  two  sub-expressions,  and  then 
applying  the  logical  operation  to  the  results.  Sub-expressions  may  be 
shared  among  the  guards,  producing  more  efficient  implementations. 
The  number  of  probe  references  per  guard  set  can  be  reduced  to  one, 
thereby  eliminating  the  need  to  store  each  probe  in  a  variable. 

The  exclusion  transformation  must  be  performed  explicitly  to  use 
the  concurrent-all  evaluation  scheme.  Each  guard  Ei  is  strengthened 
with  A}=i  ~'Ej,  ensuring  that  no  two  guards  will  evaluate  to  true.  An 
all-false  guard  A"=i  is  also  created.  By  factoring  common  sub¬ 
expressions,  the  0(n2)  increase  in  guard  size  can  be  reduced  to  adding 
0(n)  new  and  connectives  and  0(n )  nesting  levels.  The  transforma¬ 
tion  also  can  be  performed  by  a  parallel  prefix  network  [5],  leading 
to  0(n log  n)  new  and  connectives,  but  with  O(logn)  nesting  levels, 
resulting  in  the  asymptotically  fastest  evaluation  of  the  three  schemes. 

3.2.3  Concurrent- one  Guard  Evaluation 

As  in  the  concurrent-all  scheme,  all  guards  are  evaluated  concurrently, 
but  instead  of  adding  circuitry  to  wait  for  all  the  guards  to  evaluate, 
the  guards  are  transformed  so  that  only  one  path  through  the  eval¬ 
uation  circuit  is  active  at  a  time.  The  guard  set  finishes  evaluation 
when  the  single  guard  becomes  true. 
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Each  guard  is  transformed  into  AND-OR  form.  The  AND  terms 
are  further  strengthened  until  exclusive  evaluation  of  each  term  is 
ensured.  While  exponential  blow-up  may  occur  when  transform¬ 
ing  pathological  expressions,  this  method  produces  the  smallest  and 
fastest  implementations  of  most  small  expressions. 

This  scheme  cannot  be  used  to  implement  all  guard  sets.  Negated 
probes  cannot  be  used,  and  if  many  variables  axe  used  in  the  guard  set, 
AND-operators  with  several  inputs  are  required.  However,  selection 
statements  with  only  positive  probes  may  be  implemented  without 
the  busy-waiting  iteration  of  the  previous  schemes. 

4  Optimizations 

Simple  optimizations  greatly  improve  the  compiled  circuits.  Peephole 
optimization  is  applied  to  the  target  circuits  by  removing  operators 
that  can  be  shown  redundant  by  a  local  analysis  of  the  circuit.  Other 
optimizations  are  applied  at  the  program  level.  They  take  the  form  of 
proving  that  the  program  satisfies  some  invariant  condition,  and  then 
using  this  invariant  to  simplify  the  implementation  of  a  construct. 

As  an  example,  the  explicit  sequencing  (D-element)  between  guard 
evaluation  and  the  execution  of  the  first  statement  of  a  correspond¬ 
ing  command  can  be  removed  if  the  first  statement  of  a  command 
sequence  does  not  change  the  value  of  any  guard.  This  is  always  the 
case  if  the  guard  set  consists  of  only  constants  or  if  the  first  statement 
is  i)  a  skip,  ii)  an  assignment  to  a  variable  not  named  in  the  guard 
set,  or  iii)  a  communication  and  no  probes  are  named  in  the  guard 
set. 

After  applying  optimizations  of  these  forms,  the  switch  process 
compiles  into  the  circuit  shown  in  Figure  9.  Notice  the  removal,  from 
the  non-optimized  compilation  started  in  Figure  8,  of  the  initialization 
operators  and  the  D-element  protecting  the  constant  guard  true. 


5  Compiler  and  Performance 

We  have  implemented  the  translation  and  optimization  rules  described 
in  this  paper.  The  compiler  consists  of  approximately  800  PROLOG 
clauses  and  is  based  on  the  program-to-machine-code  compiler  de¬ 
scribed  in  [11].  PROLOG  provides  an  excellent  environment  for  de¬ 
veloping  compilers  and,  in  our  case,  its  unification  procedure  provides 
a  simple  means  of  representing  and  composing  circuits. 
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Figure  11:  Comparison  of  Manual  and  Compiled  Designs 


The  compiled  circuits  axe  typically  no  more  than  twice  the  size 
of  those  derived  by  hand.  In  Figure  11,  a  rough  comparison  between 
the  number  of  operators  is  shown  for  a  variety  of  fabricated  hand 
designs.  The  extra  operators  needed  in  the  compiled  circuits  can  be 
explained  by  two  interrelated  factors:  The  current  source  language 
cannot  represent  the  interleaving  of  statements  within  a  sequential 
process  (reshuffling  of  handshaking  expansions) ,  and  the  current  pro¬ 
gram  level  optimizer  does  not  detect  all  the  cases  where  explicit  se¬ 
quencing  can  be  removed.  We  expect  to  narrow  this  gap  in  the  future. 

Using  3jum  SCMOS,  we  have  fabricated  several  circuits  derived 
manually  from  concurrent  programs.  We  have  used  either  a  standard¬ 
cell  place-and-route  preprocessor  for  the  MAGIC  design  tool,  or  the 
MOSIS  FUSION  service[l],  to  generate  layout  from  an  electrically 
optimized  operator-level  description  of  the  circuit.  Rudimentary  elec¬ 
trical  optimization  is  performed  individually  on  each  circuit  variable 
by  inserting  appropriately  sized  drivers  to  minimize  the  transition 
time  of  each  operator.  Such  simple  techniques  have  met  with  good 
results;  in  an  early  implementation  of  a  mesh  router  (Figure  2),  the 
transition  time  of  a  typical  operator  was  2.3ns,  resulting  in  a  pad- 
to-pad  propagation  delay  of  45ns  for  an  arbiter  and  a  switch  process 
connected  in  series.  We  expect  similar  electrical  performance  from 
chips  generated  by  the  compiler. 

The  translation  method  produces  correct,  self-timed  implementa¬ 
tions  of  arbitrarily  large  concurrent  programs,  and  because  each  trans¬ 
lation  rule  is  of  fixed  size,  the  size  of  an  implementation  is  no  worse 
than  linearly  related  to  the  size  of  the  source  program.  The  trans¬ 
lation  method  and  the  compiler  provide  a  constructive  proof  that  a 
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design  methodology  based  on  programs  instead  of  circuits  is  not  in¬ 
herently  inefficient  j  and  thus  represents  a  practical  approach  to  the 
design  of  VLSI  systems. 
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